Quiz: One-Step Dynamics

It will prove convenient to represent the environment's dynamics using mathematical notation. In this concept, we will introduce this notation (which can be used for any reinforcement learning task) and use the recycling robot as an example.

At an arbitrary time step t, the agent-environment interaction has evolved as a sequence of states, actions, and rewards

(S_0, A_0, R_1, S_1, A_1, \ldots, R_{t-1}, S_{t-1}, A_{t-1}, R_t, S_t, A_t).

When the environment responds to the agent at time step t+1, it considers only the state and action at the previous time step (S_t, A_t).

In particular, it does not care what state was presented to the agent more than one step prior. (In other words, the environment does not consider any of { S_0, \ldots, S_{t-1} }.)

And, it does not look at the actions that the agent took prior to the last one. (In other words, the environment does not consider any of { A_0, \ldots, A_{t-1} }.)

Furthermore, how well the agent is doing, or how much reward it is collecting, has no effect on how the environment chooses to respond to the agent. (In other words, the environment does not consider any of { R_0, \ldots, R_t } .)

Because of this, we can completely define how the environment decides the state and reward by specifying

p(s',r|s,a) \doteq \mathbb{P}(S_{t+1}=s', R_{t+1}=r|S_t = s, A_t=a)

for each possible s', r, s, \text{and } a. These conditional probabilities are said to specify the one-step dynamics of the environment.

An Example

Let's return to the case that S_t = \text{high}, and A_t = \text{search}.

Then, when the environment responds to the agent at the next time step,

with 70% probability, the next state is high and the reward is 4. In other words, p(\text{high}, 4|\text{high},\text{search}) = \mathbb{P}(S_{t+1}=\text{high}, R_{t+1}=4|S_{t} = \text{high}, A_{t}=\text{search}) = 0.7.
with 30% probability, the next state is low and the reward is 4. In other words, p(\text{low}, 4|\text{high},\text{search}) = \mathbb{P}(S_{t+1}=\text{low}, R_{t+1}=4|S_{t} = \text{high}, A_{t}=\text{search}) = 0.3.

Question 1

What is p(\text{high}, -3|\text{low},\text{search})?

QUESTION:

Enter the correct numerical value.

ANSWER:

SOLUTION:

NOTE: The solutions are expressed in RegEx pattern. Udacity uses these patterns to check the given answer

Question 2

What is p(\text{high}, 0|\text{low},\text{recharge})?

QUESTION:

Enter the correct numerical value.

ANSWER:

SOLUTION:

NOTE: The solutions are expressed in RegEx pattern. Udacity uses these patterns to check the given answer

Questions 3 and 4

Consider the following probabilities:

(1) p(\text{low}, 1|\text{low},\text{search})
(2) p(\text{high}, 0|\text{low},\text{recharge})
(3) p(\text{high}, 1|\text{low},\text{wait})
(4) p(\text{high}, 1|\text{high},\text{wait})
(5) p(\text{high}, 1|\text{high},\text{search})

SOLUTION:

(1)
(3)
(5)

SOLUTION:

(2)
(4)